CLN: break read_json, read_msgpack API, disallow string data input #5954

ghost · 2014-01-15T20:26:44Z

revisiting #5655 due to recent #5874, it's just not right.
most pandas read_* methods take a path/url of file like. the newish read_json/read_msgpack
also accept a (byte)string of data, and tries to guess whether it's data or a filepath.

That creates weird corner cases and is sufficently wrong IMO that it's worth
making a breaking change. Users will now have to wrap their data in BytesIO/StringIO.

not much of a problem, except perhaps the lost convenience of pd.read_json(j). But since json
usually comes from a file or url which are supported directly I'm hoping it won't be
that disruptive.

0.14.0 ofcourse.

jreback · 2014-01-15T21:26:05Z

tell me again why this is a problem?

ghost · 2014-01-15T21:32:16Z

Because it overloads too much. the filename typo yielding a json parse error and the 150k filename
didn't convince you this isn't good API design?

jreback · 2014-01-15T21:37:03Z

I just think its conveient...

e.g.

pd.read_json(....) rather than pd.read_json(StringIO(...))

i though you were going to have a kw that 'fixed' this, e.g. string= ?

ghost · 2014-01-15T21:40:13Z

I don't remember suggesting that. We've discussed this...?

jreback · 2014-01-15T21:41:23Z

no...was suggesting that (or maybe pd.read_jsons)

ghost · 2014-01-15T21:43:30Z

Yeah, I think the load/loads convention is pretty familiar, that'd be preferable and
more convenient then a keyword anyway. Alternatively reads_json.

jtratner · 2014-01-16T04:36:17Z

I'd vote for string= kwarg over anything else, just because for non-C
people it might be clearer. But I don't feel strongly

ghost · 2014-01-16T10:06:28Z

I don't follow the non-C bit.

Loking at it, read_html goes yet a third way, which is url/data/file-like.
Not much consistency in read_*, but different usage profiles I guess.

How about we remove data string input from read_msgpack (being binary)
and remove filename input from json, so it matches read_html? (so read_html(open(foo))
or with). I don't mind which way the ambiguity is resolve, just that is.

If we move to load/loads, we'd still have to deal with read_html which would either
have to break or be inconsistent with everything else. Not great.

Note again that neither simplejson normsgpacknor stdlibpickle` have this
absurd amount of overloading.

jreback · 2014-01-16T11:11:33Z

related to #5957

actually json/msgpack/html are consistent NOW (csv is not though).

why again is this a problem; it does the right thing with strings / files

what are the edge cases you are referring? a mistyped filename gives a JSON error?

ghost · 2014-01-16T11:26:40Z

I explained the rationale several times already, going over it again isn't likely to help.
Retracted.

ghost · 2014-01-16T11:27:49Z

and read_html does not accept filenames except as url, so they are not consistent.

cpcloud · 2014-01-16T20:22:12Z

Then that's a bug. Last time I checked, read_html could read files. IIRC there are tests for that.

ghost · 2014-01-16T20:27:55Z


def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, infer_types=None, attrs=None, parse_dates=False,
              tupleize_cols=False, thousands=','):
    r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.

    Parameters
    ----------
    io : str or file-like
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.

So at least the docstring suggests it's reasonable.

Et tu, Brute? sigh

cpcloud · 2014-01-16T20:30:50Z

Oh whoops I responded too hastily, I'm not sure if read_html accepts plain filenames a la path/to/some/html.html. IIRC resolving that ambiguity wasn't worth it because you can just with open() as f: it.

ghost · 2014-01-16T20:32:56Z

Thank you for that. To me it's obvious this should change, but considering there's so much
resistance, I've let it go. moving on.

cpcloud · 2014-01-17T03:08:08Z

In the interest of completeness, and at the risk of beating a dead horse, read_html does the following before reading in data:

Checks if the string is a url
Not a URL? Checks for the existence of a read attribute (e.g., StringIO and file objects)
Doesn't have a read attribute? Checks if the string is an existing file.
Not a file? Check if it's a raw string
Not a raw string? It's a TypeError.

cpcloud · 2014-01-17T03:09:58Z

Of course if it's a non existent path then it will be treated as a raw string...which is the issue.

ghost mentioned this pull request Jan 15, 2014

API: allow read_pickle to read from strings (and not just files) #5924

Closed

CLN: break read_json, read_msgpack API, disallow string data input

34d5501

jreback mentioned this pull request Jan 16, 2014

API: add reads_json, reads_msgpack, reads_csv, reads_pickle as convience functions which accept a string #5957

Closed

ghost closed this Jan 16, 2014

ghost deleted the PR_fix_read_json_msgpack branch January 16, 2014 11:26

ghost mentioned this pull request Feb 4, 2014

DOC/API: provide a doc matrix of all indexing accessors and behaviors #5421

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: break read_json, read_msgpack API, disallow string data input #5954

CLN: break read_json, read_msgpack API, disallow string data input #5954

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jtratner commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 17, 2014

cpcloud commented Jan 17, 2014

CLN: break read_json, read_msgpack API, disallow string data input #5954

CLN: break read_json, read_msgpack API, disallow string data input #5954

Conversation

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jreback commented Jan 15, 2014

ghost commented Jan 15, 2014

jtratner commented Jan 16, 2014

ghost commented Jan 16, 2014

jreback commented Jan 16, 2014

ghost commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 16, 2014

ghost commented Jan 16, 2014

cpcloud commented Jan 17, 2014

cpcloud commented Jan 17, 2014